nycflights13
Change “your name” in the YAML header above to your name.
As usual, enter the examples in code chunks and run them, unless told otherwise.
Read R4ds Chapter 10: Tibbles, sections 1-3.
Load the tidyverse package.
library(tidyverse)
Enter your code chunks for Section 10.2 here.
as.tibble(iris)
tibble(
x = 1:5,
y = 1,
z = x ^ 2 + y
)
tb <- tibble(
`:)` = "smile",
` ` = "space",
`2000` = "number"
)
tb
tribble(
~x, ~y, ~z,
#--|--|----
"a", 2, 3.6,
"b", 1, 8.5
)
Describe what each chunk code does. #1 Coeresed the data frame iris to tibble. #2 Made a tibble from a vector. #3 Creates a table with nonsynactic names #4 creats tibble with a very smal ammount of data ### 10.3: Tibbles vs data.frame
Enter your code chunks for Section 10.3 here. #3.2.1 #This tibble creates a table 1000 columns by 5 rows.
tibble(
a = lubridate::now() + runif(1e3) * 86400,
b = lubridate::today() + runif(1e3) * 30,
c = 1:1e3,
d = runif(1e3),
e = sample(letters, 1e3, replace = TRUE)
)
#Cretes a table with infinite columns and 10 rows.
nycflights13::flights %>%
print(n = 10, width = Inf)
#this gives you information about tibble.
package?tibble
#Opens a new tab with data.
nycflights13::flights %>%
View()
#3.2.2 #this made a data frame from the variables
df <- tibble(x = runif(5), y = rnorm(5))
#extracts a variable
df$x
[1] 0.2726624 0.9823032 0.3205767 0.8882099 0.1346611
#This does the exact same thing as the code above.
df[["x"]]
[1] 0.2726624 0.9823032 0.3205767 0.8882099 0.1346611
#This extracts data by position.
df[[1]]
[1] 0.2726624 0.9823032 0.3205767 0.8882099 0.1346611
#Uses a placeholder in a pipe.
df %>% .$x
[1] 0.2726624 0.9823032 0.3205767 0.8882099 0.1346611
#This does the exact same thing as the code above.
df %>% .[["x"]]
[1] 0.2726624 0.9823032 0.3205767 0.8882099 0.1346611
#Thanks Dr. T #### Section 10.5 Questions
Answer the questions completely. Use code chunks, text, or both, as necessary.
1: How can you tell if an object is a tibble? (Hint: try printing mtcars, which is a regular data frame). Identify at least two ways to tell if an object is a tibble. Hint: What does as_tibble() do? What does class() do? What does str() do? #Using class or str will tell you if something is a tibble or not. There is also numbering of rows in a tibble but not in a data frame
mtcars
as_tibble(mtcars)
class(mtcars)
[1] "data.frame"
str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
2: Compare and contrast the following operations on a data.frame and equivalent tibble. What is different? Why might the default data frame behaviours cause you frustration? #The data frame is easier to read, but has more code to write out. The tibble is less code to enter but a little harder to read.
df <- data.frame(abc = 1, xyz = "a")
df$x
[1] a
Levels: a
df[, "xyz"]
[1] a
Levels: a
df[, c("abc", "xyz")]
df <- tibble(abc = 1, xyz = "a")
df$abc
[1] 1
Read R4ds Chapter 11: Data Import, sections 1, 2, and 5.
Nothing to do here unless you took a break and need to reload tidyverse.
Do not run the first code chunk of this section, which begins with heights <- read_csv("data/heights.csv"). You do not have that data file so the code will not run.
Enter and run the remaining chunks in this section.
read_csv("a,b,c
1,2,3
4,5,6")
read_csv("The first line of metadata
The second line of metadata
x,y,z
1,2,3", skip = 2)
read_csv("# A comment I want to skip
x,y,z
1,2,3", comment = "#")
read_csv("1,2,3\n4,5,6", col_names = FALSE)
read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
read_csv("a,b,c\n1,2,.", na = ".")
1: What function would you use to read a file where fields were separated with “|”? #You should use read_delin() because that can read files with delimiters.
2: (This question is modified from the text.) Finish the two lines of read_delim code so that the first one would read a comma-separated file and the second would read a tab-separated file. You only need to worry about the delimiter. Do not worry about other arguments. Replace the dots in each line with the rest of your code.
file <- read_delim("file.csv", ...)
`file <- read_delim("file.csv", delim = ",")```
Error: attempt to use zero-length variable name
read_csv("a,b\n1,2,3\n4,5,6")
2 parsing failures.
row col expected actual file
1 -- 2 columns 3 columns literal data
2 -- 2 columns 3 columns literal data
read_csv("a,b,c\n1,2\n1,2,3,4")
2 parsing failures.
row col expected actual file
1 -- 3 columns 2 columns literal data
2 -- 3 columns 4 columns literal data
read_csv("a,b\n\"1")
2 parsing failures.
row col expected actual file
1 a closing quote at end of file literal data
1 -- 2 columns 1 columns literal data
read_csv("a,b\n1,2\na,b")
read_csv("a;b\n1;3")
3: What are the two most important arguments to read_fwf()? Why? #col_position and col_types becauae with these you can name as well as specify what goes in each column.
4: Skip this question
5: Identify what is wrong with each of the following inline CSV files. What happens when you run the code?
table4a
#The first one codes for 2 columns but has enough code for three columns. The next code is set up for three columns but only has info fo two. The next code was set up for two columns but has enough code for one column. The next has too much listed as headers. the last one is seperated by semicolons when it should be seperated by commas. ### 11.3 and 11.4: Not required
Just read this section. You may find it helpful in the future to save a data file to your hard drive. It is basically the same format as reading a file, except that you must specify the data object to save, in addition to the path and file name.
Read R4ds Chapter 18: Pipes, sections 1-3.
Nothing to do otherwise for this chapter. Is this easy or what?
Note: Trying using pipes for all of the remaining examples. That will help you understand them.
Read R4ds Chapter 12: Tidy Data, sections 1-3, 7.
Nothing to do here unless you took a break and need to reload the tidyverse.
Study Figure 12.1 and relate the diagram to the three rules listed just above them. Relate that back to the example I gave you in the notes. Bear this in mind as you make data tidy in the second part of this assignment.
You do not have to run any of the examples in this section.
Read and run the examples through section 12.3.1 (gathering), including the example with left_join(). We’ll cover joins later.
table4a %>%
gather(`1999`, `2000`, key = "year", value = "cases")
table4b %>%
gather(`1999`, `2000`, key = "year", value = "population")
tidy4a <- table4a %>%
gather(`1999`, `2000`, key = "year", value = "cases")
tidy4b <- table4b %>%
gather(`1999`, `2000`, key = "year", value = "population")
left_join(tidy4a, tidy4b)
Joining, by = c("country", "year")
table2
table2 %>%
spread(key = type, value = count)
table2 %>%
spread(key = type, value = count)
2: Why does this code fail? Fix it so it works.
table4a %>%
gather(`1999`, `2000`, key = "year", value = "cases")
table4a %>%
gather(`1999`, `2000`, key = "year", value = "cases")
#There needs to be ticks around the years. That is all for Chapter 12. On to the last chapter.
Read R4ds Chapter 5: Data Transformation, sections 1-4.
Time to get small.
Load the necessary libraries. As usual, type the examples into and run the code chunks.
library(tidyverse)
nycflights13::flights
flights
filter(flights, month == 1, day == 1)
filter(flights, month == 1, day == 1)
(dec25 <- filter(flights, month == 12, day == 25)))
Error: unexpected ')' in "(dec25 <- filter(flights, month == 12, day == 25)))"
filter(flights, month == 1)
sqrt(2) ^ 2 == 2
[1] FALSE
filter(flights, month == 11 | month == 12)
filter()Study Figure 5.1 carefully. Once you learn the &, |, and ! logic, you will find them to be very powerful tools.
nov_dec <- filter(flights, month %in% c(11, 12))
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)
NA > 5
[1] NA
10 == NA
[1] NA
NA + 10
[1] NA
NA / 2
[1] NA
NA == NA
[1] NA
x <- NA
y <- NA
x == y
[1] NA
is.na(x)
[1] TRUE
df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)
filter(df, is.na(x) | x > 1)
filter(flights, (arr_delay > 120 | dep_delay > 120))
1.1: Find all flights with a delay of 2 hours or more.
filter(flights, dest == 'IAH' | dest == 'HOU')
1.2: Flew to Houston (IAH or HOU)
filter(flights, carrier == 'UA' | carrier == 'AA' | carrier == 'DL')
1.3: Were operated by United (UA), American (AA), or Delta (DL).
filter(flights, month >= 7 & month <= 9)
1.4: Departed in summer (July, August, and September).
filter(flights, arr_delay > 120, dep_delay <= 0)
1.5: Arrived more than two hours late, but didn’t leave late.
filter(flights, dep_delay >= 60, dep_delay-arr_delay > 30)
1.6: Were delayed by at least an hour, but made up over 30 minutes in flight. This is a tricky one. Do your best.
filter(flights, dep_time <=600 | dep_time == 2400)
1.7: Departed between midnight and 6am (inclusive)
filter(flights, between(month, 7, 9))
filter(flights, !between(dep_time, 601, 2359))
2: Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?
summary(flights)
year month
Min. :2013 Min. : 1.000
1st Qu.:2013 1st Qu.: 4.000
Median :2013 Median : 7.000
Mean :2013 Mean : 6.549
3rd Qu.:2013 3rd Qu.:10.000
Max. :2013 Max. :12.000
day dep_time
Min. : 1.00 Min. : 1
1st Qu.: 8.00 1st Qu.: 907
Median :16.00 Median :1401
Mean :15.71 Mean :1349
3rd Qu.:23.00 3rd Qu.:1744
Max. :31.00 Max. :2400
NA's :8255
sched_dep_time dep_delay
Min. : 106 Min. : -43.00
1st Qu.: 906 1st Qu.: -5.00
Median :1359 Median : -2.00
Mean :1344 Mean : 12.64
3rd Qu.:1729 3rd Qu.: 11.00
Max. :2359 Max. :1301.00
NA's :8255
arr_time sched_arr_time
Min. : 1 Min. : 1
1st Qu.:1104 1st Qu.:1124
Median :1535 Median :1556
Mean :1502 Mean :1536
3rd Qu.:1940 3rd Qu.:1945
Max. :2400 Max. :2359
NA's :8713
arr_delay carrier
Min. : -86.000 Length:336776
1st Qu.: -17.000 Class :character
Median : -5.000 Mode :character
Mean : 6.895
3rd Qu.: 14.000
Max. :1272.000
NA's :9430
flight tailnum
Min. : 1 Length:336776
1st Qu.: 553 Class :character
Median :1496 Mode :character
Mean :1972
3rd Qu.:3465
Max. :8500
origin dest
Length:336776 Length:336776
Class :character Class :character
Mode :character Mode :character
air_time distance
Min. : 20.0 Min. : 17
1st Qu.: 82.0 1st Qu.: 502
Median :129.0 Median : 872
Mean :150.7 Mean :1040
3rd Qu.:192.0 3rd Qu.:1389
Max. :695.0 Max. :4983
NA's :9430
hour minute
Min. : 1.00 Min. : 0.00
1st Qu.: 9.00 1st Qu.: 8.00
Median :13.00 Median :29.00
Mean :13.18 Mean :26.23
3rd Qu.:17.00 3rd Qu.:44.00
Max. :23.00 Max. :59.00
time_hour
Min. :2013-01-01 05:00:00
1st Qu.:2013-04-04 13:00:00
Median :2013-07-03 10:00:00
Mean :2013-07-03 05:22:54
3rd Qu.:2013-10-01 07:00:00
Max. :2013-12-31 23:00:00
#This code pulls the data from the first varible listed and the last lsited. It makes the previous two questins much easier. 3: How many flights have a missing dep_time? What other variables are missing? What might these rows represent?
NA ^ 0
[1] 1
NA | TRUE
[1] TRUE
FALSE & NA
[1] FALSE
8255= missing dep_time + dep_delay. 8713= missing arr_time. 9430=missing arr_delay. 9430= missing air_time. 4: Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)
arrange(flights, year, month, day)
Note: For some context, see this thread
arrange()arrange(flights, desc(dep_delay))
df <- tibble(x = c(5, 2, NA))
arrange(df, x)
arrange(df, desc(x))
arrange(df, desc(is.na(x)))
1: How could you use arrange() to sort all missing values to the start? (Hint: use is.na()). Note: This one should still have the earliest departure dates after the NAs. Hint: What does desc() do?
arrange(flights, desc(dep_delay))
arrange(flights, dep_delay)
2: Sort flights to find the most delayed flights. Find the flights that left earliest.
This question is asking for the flights that were most delayed (left latest after scheduled departure time) and least delayed (left ahead of scheduled time).
arrange(flights, air_time)
3: Sort flights to find the fastest flights. Interpret fastest to mean shortest time in the air.
arrange(flights, distance/hour)
Optional challenge: fastest flight could refer to fastest air speed. Speed is measured in miles per hour but time is minutes. Arrange the data by fastest air speed.
arrange(flights, distance)
4: Which flights travelled the longest? Which travelled the shortest
arrange(flights, desc(distance))
arrange(flights, desc(distance))
select()select(flights, year, month, day)
select(flights, year:day)
select(flights, -(year:day))
rename(flights, departuretime = dep_time)
?select
vars <- c("dep_time", "dep_delay", "arr_time", "arr_delay")
select(flights, starts_with("dep"), starts_with("arr"))
select(flights, one_of(vars))
select(flights, dep_time, dep_delay, arr_time, arr_delay)
NA
1: Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights. Find at least three ways.
select(flights, dest, origin, dest, dest)
2: What happens if you include the name of a variable multiple times in a select() call?
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
select(flights, one_of(vars))
#There doesnt seem to be any problem with this. 3: What does the one_of() function do? Why might it be helpful in conjunction with this vector?
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
select(flights, one_of(vars))
#This function only uses the variables in the vector.
4: Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
select(flights, contains("TIME"))
select(flights, contains("TIME", ignore.case = TRUE))
#This picks out all variables with the indicated word or phrase.